Adaptive Optimal Control of Hybrid Electric Vehicle Power Battery via Policy Learning

An online policy learning algorithm is used to solve the optimal control problem of the power battery state of charge (SOC) observer for the first time. The design of adaptive neural network (NN) optimal control is studied for the nonlinear power battery system based on a second-order (RC) equivalent circuit model. First, the unknown uncertainties of the system are approximated by NN, and a time-varying gain nonlinear state observer is designed to address the problem that the resistance capacitance voltage and SOC of the battery cannot be measured. Then, to realize the optimal control, a policy learning-based online algorithm is designed, where only the critic NN is required and the actor NN widely used in most design of the optimal control methods is removed. Finally, the effectiveness of the optimal control theory is verified by simulation.


Introduction
Nowadays, electric vehicles are developing at a high speed [1]. Te power battery provides the required high power for vehicle start stop, acceleration and deceleration, and other instabilities and greatly improves the service life of fuel cells by controlling the charging and discharging power of the power battery [1,2]. As an important energy storage part of fuel-cell hybrid vehicles, it has far-reaching signifcance for the research of power cells. Te state of charge (SOC) in the battery is one of the important parameters of the battery management system (BMS), but SOC cannot be directly measured by the on-board sensors. Terefore, SOC estimation is a very important problem in the theory and application. Moreover, the power battery is a highly complex nonlinear system in its working state, which greatly increases the difculty of estimation [3].
In order to meet the requirements of accurate, fast, and real-time estimation of power battery SOC under diferent conditions, scholars have carried out a lot of advanced achievements. In [4], the authors proposed an observer-based unilateral Lipschitz conditional nonlinear system control method for a class of nonlinear systems with time-varying parameter uncertainties and norm bounded disturbances. For the state-space equation of the equivalent circuit model, a power battery SOC estimation method based on nonlinear observer is proposed in [5]. Te authors in [6] introduced the second-order resistance capacitance (RC) model of the battery pack. Under the unilateral Lipschitz condition, a nonlinear observer based on the H∞ method is designed, but whether the optimal performance of the observer can be guaranteed remains to be verifed. For the problem of optimal control design of the observers, the authors proposed an adaptive neural network backstepping recursive optimal control method for nonlinear strict feedback systems with state constraints [7]. Te neural network (NN) state identifcation is used to approximate the unknown nonlinear dynamics, and under the actor-critic structure, the virtual and actual optimal controllers are constructed through the backstepping recursive control algorithm. Because actor-critic structurebased adaptive laws are generated on the basis of the square of Behrman residual error obtained by the gradient descent method, these methods are too complex and difcult to implement. In this regard, the authors in [8] proposed an optimal control method based on reinforcement learning (RL) for a class of nonlinear strict feedback systems with unknown dynamic functions. Tis method eliminates the persistent excitation assumption necessary for most RL-based adaptive optimal control. On this basis, the adaptive NN outputfeedback optimal control problem for a class of strict feedback nonlinear systems with unknown internal dynamics, input saturation, and state constraints is studied in [9]. In [10,11], the authors proposed the novel optimal control algorithm based on advanced AI techniques, which further promotes the development of the optimal control theory.
Inspired by the abovementioned research results, a nonlinear observer with time-varying gain is designed in this paper. Based on the unilateral Lipschitz condition, the nonlinear dynamic problem contained in the system output is solved. Te internal unknown dynamic function is approximated by NN to estimate the SOC and the resistance capacitance voltage of the dynamic battery in the power system. Ten, based on estimated system states, we develop a policy learning-based optimal control and the estimated weight error is convergence to zero. Finally, the simulation results show the efectiveness of the proposed method.
Te innovations of this paper are summarized as follows: (1) Te optimal control method based on critic NN is used to solve the optimal control problem of the power battery SOC observer for the frst time.
(2) Only one critic NN is used to ensure the convergence of the NN weights; thus, the actor NN widely used in most design of optimal control methods [12][13][14] is removed. (3) Unlike the existing optimal control with known state, the battery state in this paper is unknown. Tis leads to a complex optimal control problem.

System Modeling
In this paper, we consider the second-order RC equivalent circuit model as shown in Figure 1 [15], where U oc is the open-circuit voltage (OCV) respected to SOC, I T represents the current, U T denotes the terminal voltage, R 0 is the ohmic resistance, R 1 and R 2 are the electrochemical polarization resistance and the concentration polarization resistance, respectively, and C 1 and C 2 are the capacitances. U 1 and U 2 show the voltage of the electrochemical capacitor C 1 and concentration polarization capacitor C 2 , respectively. Ten, based on the Kirchhof voltage laws, the state equation of Figure 1 can be given as where Q n is the nominal capacity of the battery. Ten, its output equation can be defned as where 0 ≤ SOC ≤ 1, and U oc (SOC) is the nonlinear monotone increasing function. Based on (1) and (2), we can obtain state space equation as follows: As the power battery is a highly complex nonlinear system in its working state, there are many unknown uncertainties such as ambient temperature, battery selfdischarge, battery life, and cycle interval. Terefore, the state space expression (3) can be expressed as follows: where d(x) represents nonlinear characteristics.
Assumption 1. In this paper, we assume that (A, B) is stabilizable and (A, C) is detectable. Te nonlinear term d(x) is continuous and bounded. Control objective: for the second-order RC equivalent model of power battery, based on an adaptive observer a policy learning algorithm-based optimal controller is designed to guarantee all signals of the closed-loop system uniformly ultimately bounded (UUB). Figure 1: Te schematic diagram of the second-order RC model.
According to the second-order RC model of the power battery, we can derive its state space (3) or (5); then, we should design the control law u for the derived state space equation. Tus, we will use the NN observer and the policy learning algorithm to design the control law u.

Optimal Control of Power Battery
3.1. Observer Design via NN. Tis section will design an observer to estimate the battery voltage and SOC. Tus, we assume where is the activation function, and ε(x) ∈ R denotes the NN error.
In this paper, the function d(x) is unknown continuous; hence, the estimated function is where W 1 is the estimation of W 1 . Ten, based on (5) and (7), the observer can be designed as where x is the estimation of x, L � P − 1 ∈ R 3×3 is the observation matrix, P is the positive matrix, and y is the estimation of y.
We defne the observation error Ten, from (5) and (8), we can obtain the observation error dynamic equation as Lemma 2. For system (5), if it adopts designed observer (8), the NN weights W 1 satisfy the adaptive law Tis can guarantee that errors x and W 1 are UUB.
Proof. Consider a Lyapunov function where α min and α max are the minimum and maximum values of the change rate of the _ U oc function, respectively. Ten, the derivation of (12) gives where M � m 1 , m 2 , m 3 T ∈ R 3 . According to the unilateral Lipschitz condition [9], the following inequalities can be obtained: Computational Intelligence and Neuroscience 3 Taking (14) and (15) into (13), and considering Based on [8], let PA (16) can be further written as where a 0 � λ min (ψ) − 3/2 and D 0 � ‖P‖ 2 ‖W (17) can converge to zero. Moreover, by selecting the appropriate matrix ψ, λ min (ψ) can be relatively large. According to (17), the observation error can converge to a small neighborhood containing the origin. where , and L is the Lyapunov function.
To realize the optimal control, we frst defne the cost function as\ With r(x, u) � x T Q s x + u T R s u being the utility function, Q s ∈ R 3×3 and R s ∈ R are the weight matrices of proper dimension.
We defne the Hamiltonian function of the optimal control problem and the optimal cost function as Te optimal cost function V * (x) is the solution of the following HJB equation: With ∇V * (x) � zV * (x)/zx, we can obtain this optimal control action as and the HIB equation in terms of ∇V * (x) as with V * (0) � 0.
Tis algorithm will be convergence to the optimal control and optimal cost function when i ⟶ ∞. Te convergence of this algorithm can be referred to [16,17].

NN Implementation.
We assume the cost function V(x) is continuously diferentiable. Ten, we can use the NN reconstruct the V(x) as where W 2 ∈ R N is the ideal NN weights, σ c (x) ∈ R n is the activation function, and ε c (x) ∈ R denotes the NN error. Ten, where ∇σ(x) � zσ c (x)/zx and ∇ε c (x) � zε c (x)/zx are the gradient of the activation function and NN error, 4 Computational Intelligence and Neuroscience respectively. According to (28), we can obtain the Lyapunov function as Assumption 3. (see [12][13][14]18]). If the NN weight W 2 , the NN error ε c , the gradient ∇σ c , and derivative ∇ε c are bounded, then we can have ε c ⟶ 0 and ∇ε c ⟶ 0. We defne the estimation of (27) as Ten, we have with ∇V(x) � zV(x)/zx. Tus, the estimated Hamiltonian function can be given as To minimize error (32), we construct the objective function J � (1/2)e T c e c , and then the descent algorithm can be designed as with α 1 > 0 being the learning gain of the NN. Based on (29), the Hamiltonian function can be rewritten as where e h � − (∇ε c (x)) T _ x is the residual error. Defne ϕ � ∇σ c (x) _ x, if there is a positive constant ϕ M such that ‖ϕ‖ ≤ ϕ M , and denote the weight estimation error W 2 � W 2 − W 2 , and then based on (32) and (34), we have e h − e c � W T 2 ϕ; thus, we have the dynamic of the weight estimation error as Te persistent excitation (PE) condition is required to tune the NN, guaranteeing ‖ϕ‖ ≥ ϕ m with ϕ m being the positive constant. To this end, a probing noise is inserted into the system to meet the PE.
In this case, the optimal control action can be given as and its estimation is Equation (37) shows that using the trained critic network, the control policy can be derived directly; thus, the actor NN is removed in this paper. Te structural diagram of the algorithm is given in Figure 2.

Lemma .
For system (18), the adaptive law for the NN is provided by (33), and then the weight estimation error of NN is UUB.
Proof. Choose the Lyapunov function as . Te time derivative of the Lyapunov function along the trajectory of error dynamics (35) is After doing some basic manipulations, we have Considering the Cauchy-Schwarz inequality and noticing the assumption ‖ϕ‖ ≤ ϕ M , we can conclude that _ K(t) < 0 as long as 1 < α 1 < 2 and According to the Lyapunov theory, we obtain that the dynamics of the weight estimation error is UUB. Te norm of the weight estimation error is bounded as well.
It is noted that the estimated weight W 2 is optimal to W 2 , and this indicates that the solution V can be extracted from the estimated vector W 2 given in (30). Tus, one can derive the actual control u � − 1/2R − 1 s B T (∇σ(x)) T W 2 for system (18) based on W 2 . As a consequence of Lemma 4, we can conclude that u will converge to the optimal control u * , i.e., ‖u − u * ‖ ⟶ 0 such that the control system stability can be retained based on Lemma 4. □ Remark 5. In this paper, an observer is designed using NN to online estimate the unknown state (SOC); then, based on the estimated state, we develop a policy learning algorithm to online resolve the optimal control of the battery. Te proposed methods are diferent from our previous work, such as [18], where the system states are assumed to be known, and this limits the application of the optimal control algorithm in practice.

Control action
Controlled system HJB equation Computational Intelligence and Neuroscience Remark 6. To realize the output-feedback control using the policy learning, the PE condition is required in this paper. As shown in [14,17], to guarantee the PE condition, an alternative way is to insert an exploration noise into the system for the frst two seconds [17].

Simulation Results
For the second-order RC equivalent model of power battery, the efectiveness of the optimal control theory in this paper is verifed by simulation based on Matlab. Te values of resistance, capacitance, and battery capacity in the secondorder RC equivalent model (5) are as follows: Let M � I, then we can obtain P and L as Given the design parameters in learning law (33) as α 1 � 0.1 and the initial values as We aim at obtaining an optimal control policy that can stabilize system (18). For system (18), we need to fnd a feedback control policy that minimizes the cost function.
with Q s � I and R s � 2I. We adopt the online policy iteration algorithm to tackle the optimal control problem, where a critic network is constructed to approximate the cost function. During the implementation process of the policy learning algorithm, we introduce the noise to meet the PE condition. Te exponentially decreasing probing noise and sinusoidal signals with diferent frequencies are used. Tey are introduced into the control input and thus afect the system states. Te evolution of the state trajectory is depicted in Figure 3, and this can be used to further design the optimal controller for the proposed system. Figure 4 gives the good estimated weights, where we have that the convergence of the weight has occurred after 1000 s. Ten, the probing signal is turned of. Tis good convergence of the NN weights can ensure the stability of the controlled system, which can be found in Figure 5. Figure 5 is the controller system trajectory with the designed optimal controller. We see that the state converge to zero after the probing noise is turned of. Figure 6 shows the cost of the system under which       Computational Intelligence and Neuroscience is smooth, and this indicates that the designed controller is efective. Te control action is given in Figure 7, which is bounded. Tis further shows Lemma 4 is true.
To show the improved performance of the proposed single critic NN-based ADP for solving the derived optimal control problem, a critic-actor NN-based online learning method [19] is also used for comparison. Moreover, in this comparison, we add the robustness verifcation of the proposed method. To this end, we set the nonlinear term d(x) � 0.5 sin(x 1 ). Te profles of the critic NN and actor NN weights can be found in Figure 8 and the corresponding control performances are given in Figure 9. Compared with Figures 9(a) and 9(b), it is clear that the proposed single critic NN-based can achieve faster transient state convergence even if there is a nonlinear term.
Generally, the modeling accuracy and control structure will infuence the control performance of the closed-loop control systems. In this paper, the main factors afecting the control performance are the modeling uncertainties of the system and the convergence performance of critic NN weights. Moreover, better convergence of critic NN weights, i.e., faster convergence speed can help to achieve better control performance. In this respect, diferent choices of critic NN parameters and structure will afect the convergence of critic NN weights and the control performance. Hence, proper selection of NN parameters and structure, such as the initial value of weights, learning gain, and regressor structure, is helpful to further improve the control response.

Conclusion
For the second-order RC equivalent nonlinear system of power battery, the unknown uncertainty of the system is approximated by NN, and a time-varying gain nonlinear state observer is designed to solve the problem that the resistance capacitance voltage and charge (SOC) of the battery cannot be measured. Ten, to realize the optimal control, a policy learning-based online algorithm is designed, where only the critic NN is required, and the actor NN widely used in most design of the optimal control methods is removed. Finally, the efectiveness of the optimal control theory is verifed by simulation.

Data Availability
Te data used to support the fndings of this study are available upon request from the corresponding author.   [19,20]. 8 Computational Intelligence and Neuroscience